library(factoextra) library(pheatmap) library(tidyverse) theme_set(theme_bw())
library(factoextra) library(pheatmap) library(tidyverse) theme_set(theme_bw())
We usually think of high-dimensional data as consisting of multiple measures on a group of samples:
Many scientists traditionally think of high-dimensional data as having parallel, continuous measures:
These may be complemented by a smaller number of “metadata” variables, which may be more diverse in type (count, categorical, etc.):
More and more datasets don’t follow this:
We study the rows (samples) using the columns (measures)
But we can also do the opposite!
print(names(decathlon2)) ## from factoextra pkg
## [1] "X100m" "Long.jump" "Shot.put" "High.jump" ## [5] "X400m" "X110m.hurdle" "Discus" "Pole.vault" ## [9] "Javeline" "X1500m" "Rank" "Points" ## [13] "Competition"
dec_frame <- (decathlon2[1:10]
%>% as_tibble()
%>% rename_all(sub, pattern="^X", replacement="Run_")
%>% mutate_at(vars(contains('Run_')), ~(-1*.))
%>% rename_all(sub, pattern="110m.", replacement="")
%>% mutate_all(~c(drop(scale(.))))
)

Iris pairs plot (iris.R)
dec_mat <- as.matrix(dec_frame) heatmap(dec_mat)
heatmap(t(dec_mat))
pheatmap(cor(dec_mat), cell.width = 10, cell.height = 10)
Better for visualizing groups of events
pheatmap(cor(t(dec_mat)), cell.width = 4, cell.height = 4)
Better for visualizing groups of events
A beautiful decomposition based on the idea that data points are points in a Euclidean space
We can think about the PCA as a decomposition (making observed points from idealized points)
Or we can think about it as minimizing distances:
pca_ath <- prcomp(dec_mat, scale=TRUE) fviz_screeplot(pca_ath)
fviz_pca_ind(pca_ath)
fviz_pca_var(pca_ath)
fviz_pca_var(pca_ath, axes=c(2, 3))
View scores and components:
fviz_pca_biplot(pca_ath)
load <- pca_ath$sdev * pca_ath$rotation pheatmap(load, cluster_cols=FALSE)
WRONG!!!!
load <- pca_ath$sdev * pca_ath$rotation pheatmap(load, cluster_cols=FALSE)
WRONG!!!!
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
mds <- metaMDS(dec_mat, distance="euclidean")
## 'comm' has negative data: 'autotransform', 'noshare' and 'wascores' set to FALSE
## Run 0 stress 0.175356 <<<<<<< HEAD ## Run 1 stress 0.175356 ## ... Procrustes: rmse 3.75291e-06 max resid 1.004819e-05 ## ... Similar to previous best ## Run 2 stress 0.2029784 ## Run 3 stress 0.175356 ## ... Procrustes: rmse 1.355825e-05 max resid 3.876792e-05 ## ... Similar to previous best ## Run 4 stress 0.2028176 ## Run 5 stress 0.175356 ## ... Procrustes: rmse 4.319605e-05 max resid 0.0001589884 ## ... Similar to previous best ## Run 6 stress 0.1975353 ## Run 7 stress 0.2035142 ## Run 8 stress 0.1787377 ## Run 9 stress 0.2020938 ## Run 10 stress 0.1787376 ## Run 11 stress 0.2146083 ## Run 12 stress 0.2035077 ## Run 13 stress 0.1783895 ## Run 14 stress 0.1975374 ## Run 15 stress 0.1754619 ## ... Procrustes: rmse 0.02261613 max resid 0.08184932 ## Run 16 stress 0.1950854 ## Run 17 stress 0.2028353 ## Run 18 stress 0.1950854 ## Run 19 stress 0.1788135 ## Run 20 stress 0.1787378 ## *** Solution reached======= ## Run 1 stress 0.2081892 ## Run 2 stress 0.1754605 ## ... Procrustes: rmse 0.02238121 max resid 0.08059895 ## Run 3 stress 0.2008099 ## Run 4 stress 0.1755879 ## ... Procrustes: rmse 0.02861238 max resid 0.1113826 ## Run 5 stress 0.1754607 ## ... Procrustes: rmse 0.02245717 max resid 0.0810014 ## Run 6 stress 0.1754595 ## ... Procrustes: rmse 0.02222192 max resid 0.07975674 ## Run 7 stress 0.1783895 ## Run 8 stress 0.1754592 ## ... Procrustes: rmse 0.02204255 max resid 0.07924761 ## Run 9 stress 0.1783895 ## Run 10 stress 0.1754599 ## ... Procrustes: rmse 0.02197895 max resid 0.0791835 ## Run 11 stress 0.2851529 ## Run 12 stress 0.178739 ## Run 13 stress 0.1754606 ## ... Procrustes: rmse 0.02245449 max resid 0.08098683 ## Run 14 stress 0.2005249 ## Run 15 stress 0.1757427 ## ... Procrustes: rmse 0.00833189 max resid 0.03163406 ## Run 16 stress 0.1754607 ## ... Procrustes: rmse 0.02246205 max resid 0.08102683 ## Run 17 stress 0.2008961 ## Run 18 stress 0.2054214 ## Run 19 stress 0.2814632 ## Run 20 stress 0.1754596 ## ... Procrustes: rmse 0.02222927 max resid 0.07981159 ## *** No convergence -- monoMDS stopping criteria: ## 20: stress ratio > sratmax >>>>>>> refs/remotes/origin/gh-pages
plot(mds)
## species scores not available
The biplot is hard (but we can help you if you want to do it)
library(GGally)
## Registered S3 method overwritten by 'GGally': ## method from ## +.gg ggplot2
## ## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr': ## ## nasa<<<<<<< HEAD
The biplot is hard (but we can help you if you want to do it)
library(GGally)
## Registered S3 method overwritten by 'GGally': ## method from ## +.gg ggplot2
## ## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr': ## ## nasa======= >>>>>>> refs/remotes/origin/gh-pages
ggparcoord(dec_frame)<<<<<<< HEAD
library(directlabels)
ath_frame <- (
as.data.frame(t(as.matrix(dec_frame)))
%>% mutate(event=names(dec_frame))
)
ggparcoord(ath_frame)
<<<<<<< HEAD
ggparcoord(ath_frame, groupColumn="event")<<<<<<< HEAD
library(andrews) <<<<<<< HEAD andrews(dec_frame)
andrews(ath_frame)
andrews(ath_frame)
Not implemented; may suffer similar problems
Many of these charts might work if the variables were in a suitable order:
heatmap provides an order
>>>>>>> refs/remotes/origin/gh-pagesother ordering methods take a more holistic view
PCA and other decompositions could also provide an order!
>>>>>>> refs/remotes/origin/gh-pages